Bermejo , Carolin Strobl Random forest Gini importance favors SNPs with large minor allele frequency

نویسندگان

Anne-Laure Boulesteix

Andreas Bender

Justo Lorenzo Bermejo

Carolin Strobl

چکیده

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their association with the investigated phenotype. Here we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favored by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present paper is three-fold: 1) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation-based), 2) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants, and 3) to summarize our results and previously investigated properties of random forest VIMs in the context of association studies and to make practical recommendations regarding the methodological choice. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

متن کامل

Carolin Strobl , Torsten Hothorn , Achim Zeileis Party on ! A New , Conditional Variable Importance Measure for Random Forests Available in the party Package

متن کامل

Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance

Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures. For the test of Breiman and Cutler (2008), w...

متن کامل

Zeileis Danger : High Power ! – Exploring the Statistical Properties of a Test for Random Forest Variable

متن کامل

Party on ! A New

Random forests are one of the most popular statistical learning algorithms, and a variety of methods for fitting random forests and related recursive partitioning approaches is available in R. This paper points out two important features of the random forest implementation cforest available in the party package: The resulting forests are unbiased and thus preferable to the randomForest implemen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Bermejo , Carolin Strobl Random forest Gini importance favors SNPs with large minor allele frequency

نویسندگان

چکیده

منابع مشابه

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

Carolin Strobl , Torsten Hothorn , Achim Zeileis Party on ! A New , Conditional Variable Importance Measure for Random Forests Available in the party Package

Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance

Zeileis Danger : High Power ! – Exploring the Statistical Properties of a Test for Random Forest Variable

Party on ! A New

عنوان ژورنال:

اشتراک گذاری